Practical: epidemic curves, delays and transmission chains
Introduction
On the tidyverse
The tidyverse is a collection of R packages designed for data science. Developed with high standards of coding practices and software development, these packages form a consistent ecosystem to handle and visualise data. In this practical, we will mostly rely on the following packages:
dplyr for handling data, making new variables, building summaries, number-crunching
tidyr to re-arrange tidy/data
ggplot2 to visualise data
forcats to handle categorical data (
factors)magrittr for the piping operator (
%>%)
Most of these functionalities are summarised in handy cheatsheets. We provide links to the most relevant ones below:
basics of R (not tidyverse, but they remain very useful)
Other packages
Other packages we will use include:
- rmarkdown for automated report generation
Cheatsheets and other resources:
the rmarkdown cheatsheet
the knitr website documenting many options used in rmarkdown’s settings and code chunks
On the RECONverse
Several RECON packages will be used in the practical:
incidence2 to build and visualise epidemic curves
epicontacts to visualise and analyse contact tracing data or transmission chains
epitrix to estimate delay distributions
distcrete to handle delay distributions
What we will do today
This practical will use different outbreak data to illustrate how the various packages mentioned above can be combined to explore outbreak data and carry out analyses specific to the outbreak response context. Building on material seen in previous sessions, we will particularly look at:
- building epidemic curves with various time intervals and groups
- derive empirical distributions of key delays
- fit delays to discretized Gamma distributions using Maximum Likelihood, and summarising the results
- visualise and analyse transmission chains / contact tracing data
In addition, we strongly recommend storing all analyses into one or several rmarkdown reports.
How this document works
This practical will take you through initial exploratory analyses of different datasets. Some parts will be used merely to illustrate a technical point (e.g. how to count cases by different groups), some will focus more on the epidemiological meaning of a result, and some will do both. These analyses are merely meant as a guide, and are by no means an exhaustive exploration of the data. Be curious: feel free to ask new questions and try to answer them. In all instances, try to first answer your own questions: try/error is a great way to learn R. But if you feel stuck, do not hesitate to questions to the demonstrators: they are here to help.
In this practical, most command lines are provided, but not all. In the few instances where code is not provided, you will typically need to re-use and adapt previous code to produce the result requested. In most cases, copy-paste should be sufficient to reproduce the results.
However, it is important that you understand what the code does. Look at relevant documentation, try tweaking the code, experiment, and do not hesitate to ask questions. It is not essential to go through all examples to illustrate important points, so take your time, especially if you are new to R. Also note that the solutions provided are by not means the only options for solving a given problem - feel free to explore alternatives and discuss them with the trainers.
Data: Measles in Hagelloch, Germany, 1861
In this second part, we continue the analysis of the measles outbreak which took place in Hagellock, Germany, in 1861 (Oesterle 1993 ; Neal and Roberts 2004). The linelist data is distributed in the R package outbreaks, as the dataset measles_hagelloch_1861 (see ?measles_hagelloch_1861 for more information).
We first load required packages and the data, which we save in an object x for convenience.
# load packages
library(tidyverse) # this loads all tidyverse packages
library(outbreaks) # for datasets
library(incidence2) # for epicurves
library(epicontacts) # for contacts / transmission chains
# load the data, convert to tibble and store as 'x'
x <- measles_hagelloch_1861 %>%
tibble()
# dimensions of the dataset
dim(x)
## [1] 188 12
# check the content of x
x
## # A tibble: 188 × 12
## case_ID infector date_of_prodrome date_of_rash date_of_death age gender
## <int> <int> <date> <date> <date> <dbl> <fct>
## 1 1 45 1861-11-21 1861-11-25 NA 7 f
## 2 2 45 1861-11-23 1861-11-27 NA 6 f
## 3 3 172 1861-11-28 1861-12-02 NA 4 f
## 4 4 180 1861-11-27 1861-11-28 NA 13 m
## 5 5 45 1861-11-22 1861-11-27 NA 8 f
## 6 6 180 1861-11-26 1861-11-29 NA 12 m
## 7 7 42 1861-11-24 1861-11-28 NA 6 m
## 8 8 45 1861-11-21 1861-11-26 NA 10 m
## 9 9 182 1861-11-26 1861-11-30 NA 13 m
## 10 10 45 1861-11-21 1861-11-25 NA 7 f
## # … with 178 more rows, and 5 more variables: family_ID <int>, class <fct>,
## # complications <fct>, x_loc <dbl>, y_loc <dbl>
summary(x)
## case_ID infector date_of_prodrome date_of_rash
## Min. : 1.00 Min. : 1.00 Min. :1861-10-30 Min. :1861-11-03
## 1st Qu.: 47.75 1st Qu.: 31.00 1st Qu.:1861-11-23 1st Qu.:1861-11-27
## Median : 94.50 Median : 50.50 Median :1861-12-01 Median :1861-12-05
## Mean : 94.50 Mean : 82.15 Mean :1861-11-29 Mean :1861-12-03
## 3rd Qu.:141.25 3rd Qu.:153.00 3rd Qu.:1861-12-04 3rd Qu.:1861-12-07
## Max. :188.00 Max. :188.00 Max. :1862-01-24 Max. :1862-01-27
## NA's :4
## date_of_death age gender family_ID class
## Min. :1861-11-18 Min. : 0.000 f :83 Min. : 1.00 0:90
## 1st Qu.:1861-12-12 1st Qu.: 3.000 m :94 1st Qu.:18.00 1:30
## Median :1861-12-13 Median : 7.000 NA's:11 Median :31.00 2:68
## Mean :1861-12-12 Mean : 6.989 Mean :30.95
## 3rd Qu.:1861-12-15 3rd Qu.:11.000 3rd Qu.:43.00
## Max. :1861-12-28 Max. :15.000 Max. :69.00
## NA's :176
## complications x_loc y_loc
## yes:188 Min. : 7.5 Min. : 5.0
## 1st Qu.:145.0 1st Qu.: 80.0
## Median :182.5 Median :147.5
## Mean :188.2 Mean :133.8
## 3rd Qu.:240.0 3rd Qu.:187.5
## Max. :280.0 Max. :240.0
##
# ... use glimpse and View to explore contentTry using the functions glimpse, View, summary on x. What variables does the dataset contain?
Building an epicurve
Here, we show how incidence2 can be used to build epicurves using various groups and time intervals.
In its simplest form, incidence() builds a tibble reporting daily case incidence:
# basic use of incidence
i <- x %>%
incidence(date_of_prodrome)
i
## An incidence object: 36 x 2
## date range: [1861-10-30] to [1862-01-24]
## cases: 188
## interval: 1 day
## cumulative: FALSE
##
## date_index count
## <date> <int>
## 1 1861-10-30 1
## 2 1861-11-01 1
## 3 1861-11-07 2
## 4 1861-11-08 1
## 5 1861-11-11 2
## 6 1861-11-12 1
## 7 1861-11-13 1
## 8 1861-11-15 2
## 9 1861-11-17 1
## 10 1861-11-18 1
## # … with 26 more rows
class(i)
## [1] "incidence2" "incidence_df" "tbl_df" "tbl" "data.frame"
# basic plot
plot(i)However, using different time intervals or adding groups is straightforward. For instance:
# use weekly interval and groups; note: interval can be specified as a number
# (of days)
i <- x %>% incidence(date_of_prodrome,
interval = "week",
groups = c(gender, class))
i
## An incidence object: 34 x 4
## date range: [1861-W44] to [1862-W04]
## cases: 188
## interval: 1 (Monday) week
## cumulative: FALSE
##
## date_index gender class count
## <yrwk> <fct> <fct> <int>
## 1 1861-W44 <NA> 2 1
## 2 1861-W44 m 0 1
## 3 1861-W45 f 0 1
## 4 1861-W45 f 1 1
## 5 1861-W45 m 0 1
## 6 1861-W46 f 0 1
## 7 1861-W46 f 2 1
## 8 1861-W46 m 0 2
## 9 1861-W46 m 1 1
## 10 1861-W46 m 2 2
## # … with 24 more rows
# groups can be re-arranged using `regroup`
i %>%
regroup(class)
## An incidence object: 18 x 3
## date range: [1861-W44] to [1862-W04]
## cases: 188
## interval: 1 (Monday) week
## cumulative: FALSE
##
## date_index class count
## <yrwk> <fct> <int>
## 1 1861-W44 0 1
## 2 1861-W44 2 1
## 3 1861-W45 0 2
## 4 1861-W45 1 1
## 5 1861-W46 0 3
## 6 1861-W46 1 1
## 7 1861-W46 2 3
## 8 1861-W47 0 7
## 9 1861-W47 1 27
## 10 1861-W47 2 10
## 11 1861-W48 0 17
## 12 1861-W48 1 1
## 13 1861-W48 2 36
## 14 1861-W49 0 46
## 15 1861-W49 2 17
## 16 1861-W50 0 13
## 17 1861-W50 2 1
## 18 1862-W04 0 1
# cumulative incidence can be calculated using `cumulate`
i %>%
cumulate() %>%
plot()Group information can be displayed using colors (fill argument), facets (using facet_plot), or both, so that up to 2 groups can be used; for instance:
# note that we do not save the results and make the plot on the fly
x %>% incidence(date_of_prodrome,
groups = c(gender, class)) %>%
facet_plot(facets = class,
fill = gender,
nrow = 3,
show_cases = TRUE)See ?plot.incidence2 for more plotting options.
Tips to customise graphs
Using custom colors: nice colors can go a long way towards communicating results; one way to customise colors is to use HTML color codes, for instance using this tool; all ‘hex’ codes e.g.
#777284can be used in R; also note there are professional designer tools to create consistent color palettes, such as paletton.com (see palettonr for importing these palettes)Specifying data formats: date formats in R use key words, e.g.
%dfor the digit of the day,%Bfor the full month, etc.; details on these keywords are hidden in the documentation of?strptime
Exercise: The graphs produced when plotting incidence objects are regular ggplot2 objects, so they can be customised as seen before. Use what you have learnt in previous sessions to:
- create a new variable
new_genderby recoding the gender variable intomale/female/missing - similarly, create
new_classand recode the class variable inpreschool/class_1/class_2 - define a scale for dates using day/month/year e.g. 01/12/1831 for the 1st december 1831
- use a custom color palette for the new gender variable
- repeat the previous plot using these custom scales and the new variables
Estimating delays
R is good at handling dates, and is able to count the number of days separating dates. This makes calculating delays pretty straightforward. The only ‘trick’ to bear in mind is to convert the time difference to integer, as difference between dates is originally returned as a difftime object, which can behave in unexpected ways in further operations.
We illustrate this by calculating the delay between the onset of symptoms and the appearance of a rash:
# first show how date differences work
as.Date("2020-10-03") - as.Date("1982-02-04") # I am that many days old today
## Time difference of 14121 days
temp <- as.Date("2020-10-03") - as.Date("1982-02-04")
class(temp)
## [1] "difftime"
as.integer(temp)
## [1] 14121
# apply principle to more relevant epi delay, summarise, make a simple plot
x <- x %>%
mutate(delay_rash = as.integer(date_of_rash - date_of_prodrome))
x %>%
pull(delay_rash) %>%
summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 3.000 4.000 3.947 5.000 13.000
ggplot(x, aes(x = delay_rash)) +
geom_bar(fill = "#466372") +
theme_bw() +
labs(title = "Empirical distribution of delays from onset to rash",
x = "Days from onset of symptoms to rash",
y = "Number of cases")This empirical distribution is informative, but only reflects a sample of cases. Using this data, what can we say, more generally, about the delay from onset of symptoms of measles to the onset of rash in this population?
We can try and address this question by fitting a discretised Gamma distribution to the data, using epitrix’s function fit_disc_gamma:
delay_rash_fit <- x %>%
pull(delay_rash) %>%
epitrix::fit_disc_gamma()
delay_rash_fit
## $mu
## [1] 4.446486
##
## $cv
## [1] 0.3785482
##
## $sd
## [1] 1.68321
##
## $ll
## [1] -360.0598
##
## $converged
## [1] TRUE
##
## $distribution
## A discrete distribution
## name: gamma
## parameters:
## shape: 6.97842734893307
## scale: 0.637176003492939Question: compare the results of the fitted distribution to the empirical summary. What do you notice?
Let us examine the object delay_rash_fit. Contains parameters of the distribution, the corresponding log-likelihood, indications of convergence of the fitting process, as well as a distcrete object, which is essentially a list containing 4 functions providing the typical implementation of distributions in R:
$d: probability mass function$q: quantiles$p: p-values$r: random number generator
# probabilities of delays of 0, 1, ... 15 days
delay_rash_fit$distribution$d(0:15)
## [1] 1.239697e-03 4.059707e-02 1.571532e-01 2.425611e-01 2.290083e-01
## [6] 1.600206e-01 9.135690e-02 4.512995e-02 1.999983e-02 8.145375e-03
## [11] 3.100768e-03 1.116987e-03 3.842891e-04 1.271694e-04 4.070458e-05
## [16] 1.265834e-05
# 95% qantile interval
delay_rash_fit$distribution$q(c(0.025, 0.975))
## [1] 0 7
# p-value: proba of a delay > 10 days
1 - delay_rash_fit$distribution$p(10)
## [1] 0.001687248
# generate 30 delays from the distribution
delay_rash_fit$distribution$r(30)
## [1] 5 7 5 4 4 4 3 2 3 4 4 3 3 3 3 4 3 7 4 2 6 5 6 2 0 4 3 2 3 2We can assess the overall quality of the fit by comparing the empirical and fitted distributions; note the use of complete to make sure we define probabilities for delays that were not observed in the data:
# get delays from 0 to 15 days, the corresponding observed frequencies, and the
# estimated probabilities
delay_rash_comparison <- x %>%
count(delay_rash) %>% # count delays
complete(delay_rash = 0:15, fill = list(n = 0)) %>% # fill zero where needed
mutate(prop = n / sum(n)) %>% # turn counts into proportions
mutate(proba = delay_rash_fit$distribution$d(delay_rash)) # add estimated probabilities
delay_rash_comparison
## # A tibble: 16 × 4
## delay_rash n prop proba
## <int> <dbl> <dbl> <dbl>
## 1 0 2 0.0106 0.00124
## 2 1 5 0.0266 0.0406
## 3 2 29 0.154 0.157
## 4 3 31 0.165 0.243
## 5 4 72 0.383 0.229
## 6 5 25 0.133 0.160
## 7 6 11 0.0585 0.0914
## 8 7 6 0.0319 0.0451
## 9 8 4 0.0213 0.0200
## 10 9 1 0.00532 0.00815
## 11 10 0 0 0.00310
## 12 11 1 0.00532 0.00112
## 13 12 0 0 0.000384
## 14 13 1 0.00532 0.000127
## 15 14 0 0 0.0000407
## 16 15 0 0 0.0000127
# plot the results
delay_rash_comparison %>% ggplot(aes(x = delay_rash)) +
theme_bw() +
geom_col(aes(y = prop), fill = "#466372") +
geom_line(aes(y = proba)) +
labs(title = "Empirical and estimated distributions of delays from onset to rash",
subtitle = "bars: observed frequencies; line: estimated gamma",
x = "Days from onset of symptoms to rash",
y = "Proportion / probability")Question: compare the two distributions; are you satisfied with the fit to the data?
Exercise:
use the
rfunction to generate random samples from the distribution, using the same number of cases, and examine the resulting empirical distribution usingqplot()to build quick plots; repeat this a few times; does a peak as observed in the empirical distribution seem likely?What are the expected percentages of delays greater than 10 days? Same question for 5 days?
## [1] 0.1687248 16.9420068
Estimating severity
There is no outcome variable in the data, but date of death is documented for some cases. Assuming the absence of date of death indicates survival, we can build a new variable outcome using:
is.na(), which returnsTRUEfor all missing entries of a vector, andFALSEotherwiseif_elsewhich will turnTRUEandFALSEinto other values
# create outcome variable
x <- x %>%
mutate(outcome = if_else(is.na(date_of_death), 'recovered', 'dead'))
x %>%
count(outcome)
## # A tibble: 2 × 2
## outcome n
## <chr> <int>
## 1 dead 12
## 2 recovered 176
# calculate CFR
x %>%
summarise(cfr = mean(outcome == "dead"))
## # A tibble: 1 × 1
## cfr
## <dbl>
## 1 0.0638Exercise: using the function prop_ci from the previous practical, calculate the 95% confidence interval of the case fatality ratio.
Visualising transmission chains
In this section, we will use epicontacts to visualise and analyse transmission chains. Technically, transmission chains can be represented as graphs where nodes are cases, and edges represent transmisison events. Note that the same principle applies to any kind of contacts, so that this tool can be used for contact tracing as well.
The main function make_epicontacts (see documentation ?make_epicontacts) requires two types of data inputs:
a linelist, describing features of the cases (the ‘nodes’), including a unique case identifier (number or label)
contacts, describing the transmission/contacts between these cases, with at least two columns indicating the source case (‘from’) and the recipient (‘to’); note that other columns can be used to provide information on the contact, e.g. the type of contact or the relationship between the cases
Often in epidemiological datasets, contacts information is recorded as a field in the linelist. In the present dataset, these fields are:
case_ID: the unique ID of the cases; indicates the recipient, so corresponds totoinfector: the unique ID of their infector; corresponds tofrom
Here, we first separate the linelist and contact data, and then build an epicontacts object and visualise it using plot:
# isolate information on the contacts
x_contacts <- x %>%
select(infector, case_ID)
x_contacts
## # A tibble: 188 × 2
## infector case_ID
## <int> <int>
## 1 45 1
## 2 45 2
## 3 172 3
## 4 180 4
## 5 45 5
## 6 180 6
## 7 42 7
## 8 45 8
## 9 182 9
## 10 45 10
## # … with 178 more rows
# make the epicontacts object
x_chains <- make_epicontacts(
linelist = x, # linelist data
contacts = x_contacts, # contact data
id = "case_ID", # unique ID in linelist
from = "infector", # 'from' column
to = "case_ID", # 'to' column
directed = TRUE) # transmission events are directed
x_chains
##
## /// Epidemiological Contacts //
##
## // class: epicontacts
## // 188 cases in linelist; 188 contacts; directed
##
## // linelist
##
## # A tibble: 188 × 16
## id infector date_of_prodrome date_of_rash date_of_death age gender
## <int> <int> <date> <date> <date> <dbl> <fct>
## 1 1 45 1861-11-21 1861-11-25 NA 7 f
## 2 2 45 1861-11-23 1861-11-27 NA 6 f
## 3 3 172 1861-11-28 1861-12-02 NA 4 f
## 4 4 180 1861-11-27 1861-11-28 NA 13 m
## 5 5 45 1861-11-22 1861-11-27 NA 8 f
## 6 6 180 1861-11-26 1861-11-29 NA 12 m
## 7 7 42 1861-11-24 1861-11-28 NA 6 m
## 8 8 45 1861-11-21 1861-11-26 NA 10 m
## 9 9 182 1861-11-26 1861-11-30 NA 13 m
## 10 10 45 1861-11-21 1861-11-25 NA 7 f
## # … with 178 more rows, and 9 more variables: family_ID <int>, class <fct>,
## # complications <fct>, x_loc <dbl>, y_loc <dbl>, new_gender <fct>,
## # new_class <fct>, delay_rash <int>, outcome <chr>
##
## // contacts
##
## # A tibble: 188 × 2
## from to
## <int> <int>
## 1 45 1
## 2 45 2
## 3 172 3
## 4 180 4
## 5 45 5
## 6 180 6
## 7 42 7
## 8 45 8
## 9 182 9
## 10 45 10
## # … with 178 more rows
plot(x_chains)Question: examine the interactive graph; note that you can zoom in/out using your mouse, reposition nodes by dragging them across, and see linelist data associated to cases by hovering the cursor above a node. What does this tell you about how the outbreak went on? Were there super-spreading events?
To help with the latter question, we can derive the distribution of the case reproduction numbers (R):
# list all case IDs in the data
all_ids <- x_chains %>%
get_id(which = "all") %>%
na.omit()
# count secondary cases
x %>%
count(infector) %>% # this only includes cases with R >= 1
complete(infector = all_ids, fill = list(n = 0)) %>% # add cases with R = 0
ggplot(aes(x = n)) +
geom_bar(fill = "#A04F60") +
theme_bw() +
labs(title = "Distribution of case reproduction numbers",
x = "Number of secondary cases",
y = "Frequency")By default, the plot method of epicontacts object uses vis_epicontacts to produce interactive graphs. Look at the associated documentation ?vis_epicontacts and interpret the following commands and plot:
x_chains %>%
vis_epicontacts(node_color = "new_class",
node_shape = "gender",
shapes = c("m" = "male", "f" = "female"))Question: examine the interactive graph; note that you can zoom in/out using your mouse, reposition nodes by dragging them across, and see linelist data associated to cases by hovering the cursor above a node. What does this tell you about how the outbreak went on? Were there super-spreading events?
Exercise:
- adapt the code provided above to show families instead of classes as node colors; compare the resulting plots; can you postulate which transmission events took place at school or at home?
Estimating the serial interval
Exercise:
Combine two types of analyses you have carried out so far to estimate the serial interval (delay between symptom onest of primary and secondary cases).
extract the empirical distribution of the serial interval from the transmission chains
fit this distribution to a discretised Gamma
Results should match the outputs below:
## # A tibble: 184 × 1
## si
## <int>
## 1 10
## 2 12
## 3 9
## 4 10
## 5 11
## 6 9
## 7 9
## 8 10
## 9 11
## 10 10
## # … with 174 more rows
## $mu
## [1] 10.89177
##
## $cv
## [1] 0.1500888
##
## $sd
## [1] 1.634733
##
## $ll
## [1] -352.9525
##
## $converged
## [1] TRUE
##
## $distribution
## A discrete distribution
## name: gamma
## parameters:
## shape: 44.3918886018654
## scale: 0.245355063017598
## # A tibble: 21 × 4
## si n prop proba
## <int> <dbl> <dbl> <dbl>
## 1 0 0 0 1.94e-30
## 2 1 0 0 8.42e-19
## 3 2 0 0 1.05e-12
## 4 3 0 0 7.14e- 9
## 5 4 0 0 2.80e- 6
## 6 5 0 0 1.81e- 4
## 7 6 0 0 3.37e- 3
## 8 7 4 0.0217 2.49e- 2
## 9 8 19 0.103 9.03e- 2
## 10 9 32 0.174 1.86e- 1
## # … with 11 more rows
Question:
Compare the estimated serial interval to the delay between the last two cases. What does it suggest? Using arrange() combined with desc(), sort x by decreasing date of symptom onset to identify the last case. How many days has there been between the last 2 cases? Using the estimated serial interval, calculate the probability of a delay as long as this (or longer).
Transmission chains: further investigations
To complement what the visual inspection of transmission chains, it can be useful to further characterise transmission patterns e.g. in terms of gender, age classes, or locations. This requires, for each transmission pair, looking up features of the respective cases in the linelist, and reporting them. This can be done using get_pairwise, illustrated below using the new_gender variable:
# using the basic form
x_chains %>%
get_pairwise(attribute = "new_gender") %>% # get gender->gender for each transmission
table() # summarise output
## .
## female -> female female -> male female -> missing male -> female
## 37 38 3 40
## male -> male male -> missing missing -> female missing -> male
## 51 7 4 4This is a start, but not entirely satisfying, as we should be able to build a 2x2 table to summarise these patterns. If you look at ?get_pairwise, you will find we can specify what is done with the attributes of the infector and recipient by defining the function f. This permits a more meaningful output:
# using `table` directly inside get_pairwise to build a contingency table
x_chains %>%
get_pairwise(attribute = "new_gender", f = table)
## values.to
## values.from female male missing
## female 37 38 3
## male 40 51 7
## missing 4 4 0
# repeat the above and add a Chi-square test
x_chains %>%
get_pairwise(attribute = "new_gender", f = table) %>%
chisq.test()
##
## Pearson's Chi-squared test
##
## data: .
## X-squared = 1.9187, df = 4, p-value = 0.7507Here, the Chi-square test, which can be used to test whether two categorical variables are randomly associated (the null hypothesis), or on the contrary seem to be linked (the alternative hypothesis, in which case a low p-value is expected), suggests that there are no gender patterns in transmission.
We repeat this analysis with classes, and will add a graphical representation of the results:
# repeat the above with new_class
x_chains %>%
get_pairwise(attribute = "new_class", f = table)
## values.to
## values.from preschool class_1 class_2
## preschool 34 0 0
## class_1 41 27 19
## class_2 12 3 48
x_chains %>%
get_pairwise(attribute = "new_class", f = table) %>%
chisq.test()
##
## Pearson's Chi-squared test
##
## data: .
## X-squared = 97.706, df = 4, p-value < 2.2e-16
# build a plot using the contingency table
## note the use of `as.data.frame` to reshape the results before plotting
x_chains %>%
get_pairwise(attribute = "new_class", f = table) %>%
as.data.frame() %>%
rename(infector = values.from, infectee = values.to, n = Freq) %>%
ggplot(aes(y = infector, x = infectee)) +
theme_bw() +
geom_tile(aes(fill = n)) +
scale_fill_viridis_c(option = "inferno") +
labs(title = "Transmission patterns by class")Question: was transmission independent to the class structure? Which class seeded most infections? Which transmission routes were the most frequent?
Exercise:
- adapt the code provided above to look at family patterns as well as age patterns; the resulting graphs are produced below
Exercise:
examine the code below and try to understand what the results represent
considering these results together with the ones above, what can you say about patterns of transmissions in this Measles outbreak?
prop_same <- function(a, b) {
mean(a == b, na.rm = TRUE)
}
x_chains %>%
get_pairwise(attribute = "new_class",
f = prop_same)
## [1] 0.5923913
x_chains %>%
get_pairwise(attribute = "age",
f = prop_same)
## [1] 0.1630435
x_chains %>%
get_pairwise(attribute = "family_ID",
f = prop_same)
## [1] 0.4184783Exercise (advanced):
examine the code below and interpret the results; what kind of statistical method is used here? what does it show?
apply the same analysis using the class; what are your final conclusions?
prop_same_perm <- function(a, b) {
perm_a <- sample(a, replace = FALSE)
mean(perm_a == b, na.rm = TRUE)
}
prop_same_fam <- x_chains %>%
get_pairwise(attribute = "family_ID",
f = prop_same)
prop_same_fam_ref <- sapply(
1:999,
function(i)
get_pairwise(x_chains,
attribute = "family_ID",
f = prop_same_perm)
)
prop_same_fam / mean(prop_same_fam_ref)
## [1] 21.39722
qplot(prop_same_fam_ref, geom = "histogram") +
theme_bw() +
geom_vline(xintercept = prop_same_fam, color = "#E03F57", size = 1) +
labs(x = "Proportion of transmission within the same family",
y = "Frequency")References
Neal, Peter J, and Gareth O Roberts. 2004. “Statistical Inference and Model Selection for the 1861 Hagelloch Measles Epidemic.” Biostatistics 5 (2): 249–61.
Oesterle, Heike. 1993. “Statistische Reanalyse Einer Masernepidemie 1861 in Hagelloch.” PhD thesis, Uitgever niet vastgesteld.